posted 08-25-2006 09:15 PM
Anyone care to check me on this?At the recent APA conference, I spoke with Jaimie Brown and Tyler Buttle about their demo scoring tools. One of the tools seemed to measure and display the Kircher features scores of the relevant and control questions. The demo tool also computed ratio values for each question – which I assumed to be based on the Miritello article from 1999.
One of the things that I've felt is missing from our current scoring arsenal are adequate mathematical descriptions of the probability models that are consistent with the types of inferential statistics taught in every graduate school in the country. Not that our work is antithetical to common statistical models, but that we've not taken enough time to describe our models in the common empirical languages. Increasingly, that absence gets translated into “polygraph lacks validity.” The accuracy of that criticism may be that our field lacks description of polygraph validity, in the empirical vocabularies common to other sciences.
In the interest of filling in some of those gaps, I propose that we could – and I'd like to obtain a dataset to do so – establish the statistical significance of polygraph scores (i.e., differences between relevant and comparison question scores) using common statistical models.
In the case of single issues tests, we would most likely be interested in the beloved (by some) t-test of significance – commonly used in hypothesis testing (sorry Barry) of independent samples, when sample sizes are small (n<30) and when the population standard deviation is not known. For larger samples (n>30) we use a z-test, based on the standard normal (gaussian) distribution. The t-test is similar but uses the t-distribution which adjusts for sample size (degrees of freedom) as published in 1908 by Gosset, a statistician employed by the Guinness Brewery in Dublin Ireland, under the pseudonym “Student.”
For single issue tests, the t-test allows the investigator to establish the threshold alpha (level of significance) at which comparison and relevant question scores are evaluated for the stastistical significance of their difference, based on N-1 degrees of freedom. In the case of a single issue test, N equal the number of relevant questions times the number of charts.
Note that this is different than implied by Miritello (1999) whose procedure specify tabulating the ratio of each relevant question across all charts. While the logical assumption is obvious, that all relevant question in a single issue test represent the same concern, each iteration of each relevant question actually represent a distinct empirical data-point. They should therefore not be aggregated, but should be described in means and deviation scores.
In this manner, the data points for the relevant and comparison sets (as independent samples) are aggregated to mean and deviation scores for the two sets – Now N1 = relevants while N2 = comparisons
So to calculate the level of significance between relevant and comparison questions
t = (mean of N1 – mean of N2) / (Sp / sqrt of ((1/N1) + (1/N2)))
where Sp = the pooled standard deviation of N1 and N2
or
Sp = sqrt of ((N1 – 1) x (S1*S1) + (N2 – 1) x (S2*S2)) / N1 + N2 - 2
where
S1 = st dev of N1 and S2 = st dev of N2
and
S1*S1 = variance of N1 and S2*S2 = variance of N2
(this really sucks without subscripts and superscripts)
(sorry I don't know how to do math equations in a discussion thread)
here is a picture
http://www.raymondnelson.us/training/t-test_single_issue.jpg
Someone tell me if this is not right.
One of the robust features of this common statistical model is that it does not care how many relevant or comparison questions you have – 2/2, 2/3, 3/3, 3/4, 4/4, or 4/3 – and so could be used to evaluate the level of significance for any single issue test whether ZCT, B-Zone, Army MGQT, USAF-MGQT, USN-MGQT, USSS-MGQT. In this model, more questions are better, because they increase the degrees of freedom in the t-distribution.
df = N1 + N2 - 2
It also allows for the examiner to set the threshold alpha at which a decision is allowed. 0.1, 0.05, 0.01, or whatever your want.
Another advantage of this is that the t-test doesn't care how many charts you run - three, four, or five - more are better because of the improvement in df.
And stay tuned (same bat-time, same bat-channel) for the never before attempted description of a statistical model for scoring mixed issue tests.
'nuff for now
time for a Guinness
r
------------------
"Gentlemen, you can't fight in here, this is the war room."
--(Peter Sellers as President Merfin Mufley in Stanley Kubrick's Dr. Strangelove, 1964)